Shuffle less for SSE4.1 by TobiSchluter · Pull Request #137 · vitaut/zmij

TobiSchluter · 2026-05-07T14:05:16Z

In the current code, we shuffle to reverse the byte order and we shuffle to shift the bits. This combines the two. It also replaces a slow memmove for the decimal digits with another shuffle + store, so that the data is never re-read from memory.

This is my first AI-assisted patch to zmij, and the AI's main task was to figure out the special cases when moving the writing of the last digit behind the shuffles. Previously, the last_digit only sometimes was part of the memmove and could thus end up in a different position. The AI suggested to keep a table of where it goes, and so I did that. A table-free approach made the compiler add a branch and lose time.

The way I avoided to increase the table size for the various bswap + shift variations and thus messing up alignments or introducing padding is quite simple: what we want to do is to reverse the order and drop the first N digits, where N depends on whether its the integral part or the decimals, and it depends on the exponent. We don't care what is added in the end past the actual digits. So, it doesn't actually matter what the last few entries in the shuffler are, and so we just keep the same bswap shuffle table, read it at an offset and don't care what the entries at the end end up being. This only works because we reverse at the same time, so this cannot be carried to the NEON code. The additional bytes are loaded from the same struct, so it's not uninitialized memory and ASAN is happy, but I'm not sure what the standard says about this. I don't think the SIMD memory access is in its scope. BTW I changed the type of bswap to char * in order to get simpler index algebra and less casts.

Finally, in order to keep everything branch-free, in the case of dec_exp < 0 where no decimal point is inserted between the digits, the identical shuffle + store operation is simply repeated. This turned out to be measurably faster than conditionally jumping past it. Note that the zmij/canada benchmark, given Canada's geography, doesn't contain any numbers < 1, so I created a separate benchmark (not included) that uses random fixed numbers. To my surprise it was actually faster than the processing of the Canada data. A possible explanation is that it triggered the last_digit far less often but I didn't check this in detail.

On my Zen 4 this saves 5%, on my Tiger Lake it is performance neutral.

I also asked Claude to do try the same thing for NEON but none of the variations it tried came out faster than what is there right now.

ps I just saw that I didn't clean this up fully. Hence the non-squashed three-patch series with one unrelated change that I didn't mean to commit as part of this: I moved the alignas(64) annotation from the static_data declaration to the struct data declaration. The idea being that this way all the pointers that we use remember the alignment. Turns out the assembly with and without this change is the same on x64 https://godbolt.org/z/Mxz1fbKjY and ARM64 https://godbolt.org/z/ecnPzTWhK, but I think it's cleaner, so I'm not going to remove it right now even though it is unrelated.

…path

vitaut · 2026-05-09T15:37:33Z

It also replaces a slow memmove for the decimal digits with another shuffle + store, so that the data is never re-read from memory.

Which memove exactly are you referring to and would it make sense to split it into a separate PR to evaluate optimizations effects independently? Or is it a side-effect of the main optimization?

vitaut · 2026-05-09T15:40:02Z

-                          pack8(7, 6, 5, 4, 3, 2, 1, 0)};
+  // We will read from bswap at offsets for fixed format output.  Accomodate
+  // the necessary offset calculations by using char*.
+  // This is aligned to 64bits, so we won't cross a cache line when reading,


bits or bytes?

Good eyes! Will correct

TobiSchluter · 2026-05-10T12:10:58Z

It also replaces a slow memmove for the decimal digits with another shuffle + store, so that the data is never re-read from memory.

Which memove exactly are you referring to and would it make sense to split it into a separate PR to evaluate optimizations effects independently? Or is it a side-effect of the main optimization?

The second one which reads the data immediately written before. This dominated the profile.

I will commit a simplification of the evaluation of the new stuff in float_layout tonight. It's probably not surprising that the AI stuff was a bit too roundabout. I will then split this in, probably, three patches which can be benchmarked in sequence: 1. Move writing last_digit to the end with helper entries in fixed_layout 2. Merge bswap and writing of digits 3. Replace memmove for decimal places with another shuffle

shift_extra removed, make calculations more understandable.

Staged too early

TobiSchluter · 2026-05-11T13:16:58Z

@vitaut so I made the mistake of benchmarking not only with gcc but also with clang while reworking the patch: the whole thing is a bit of a mirage, with clang this is actually a pessimization, and the speed up with gcc only brings it to up to par with the pesssmized clang. So, I think this patch is misguided, and the apparent gain is actually due to gcc being led away from some questionable choices it makes in the original version. OTOH thinking about this, it became clear that I should just do what I had considered but dismissed, namely put together the string and put the decimal point in the middle while still inside the SSE register, and only do a single store to memory (+ extra_digit and last_digit). This seems to work well, and I hope to have it ready for submission later this week.

vitaut · 2026-05-11T14:23:42Z

Thanks for the detailed analysis. This sounds like a good plan and I look forward to the updated version.

TobiSchluter added 3 commits May 7, 2026 22:33

Avoid unnecessary shuffling, handle more in SSE registers on SSE 4.1 …

d3bf8c1

…path

Move alignment to type

339152e

Improve comments

7b2dab4

TobiSchluter mentioned this pull request May 7, 2026

Optimize fixed format #124

Closed

vitaut reviewed May 9, 2026

View reviewed changes

TobiSchluter added 2 commits May 10, 2026 23:40

Simplify new fixed_layout stuff

b6438b6

shift_extra removed, make calculations more understandable.

Add change missing from previous commit

e7e8b47

Staged too early

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle less for SSE4.1#137

Shuffle less for SSE4.1#137
TobiSchluter wants to merge 5 commits into
vitaut:mainfrom
TobiSchluter:shuffle_less

TobiSchluter commented May 7, 2026 •

edited

Loading

Uh oh!

vitaut commented May 9, 2026

Uh oh!

vitaut May 9, 2026

Uh oh!

TobiSchluter May 10, 2026

Uh oh!

TobiSchluter commented May 10, 2026 •

edited

Loading

Uh oh!

TobiSchluter commented May 11, 2026

Uh oh!

vitaut commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TobiSchluter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitaut commented May 9, 2026

Uh oh!

vitaut May 9, 2026

Choose a reason for hiding this comment

Uh oh!

TobiSchluter May 10, 2026

Choose a reason for hiding this comment

Uh oh!

TobiSchluter commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TobiSchluter commented May 11, 2026

Uh oh!

vitaut commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TobiSchluter commented May 7, 2026 •

edited

Loading

TobiSchluter commented May 10, 2026 •

edited

Loading